In the following exercises, we will use the data you have collected in the previous session (all comments for the video “The Census” by Last Week Tonight with John Oliver. You might have to adjust the following code to use the correct file path on your computer.
comments <- readRDS("../data/ParsedComments.rds")
Next, we go through the preprocessing steps described in the slides. In a first step, we remove newline commands from the comment strings (without emojis).
library(tidyverse)
comments <- comments %>%
mutate(TextEmojiDeleted = str_replace_all(TextEmojiDeleted,
pattern = "\\\n",
replacement = " "))
Next, we tokenize the comments and create a document-feature matrix from which we remove English stopwords.
library(quanteda)
toks <- comments %>%
pull(TextEmojiDeleted) %>%
char_tolower() %>%
tokens(remove_numbers = TRUE,
remove_punct = TRUE,
remove_separators = TRUE,
remove_symbols = TRUE,
remove_hyphens = TRUE,
remove_url = TRUE)
comments_dfm <- dfm(toks,
remove = quanteda::stopwords("english"))
term_freq.
textstat_frequency() from the quanteda package to answer this question.
docfreq from the term_freq object you created in the previous task.
We also want to look at the emojis that were used in the comments on the video “The Census” by Last Week Tonight with John Oliver. Similar to what we did for the comment text without emojis, we first need to wrangle the data (remove missings, tokenize emojis, create DFM).
emoji_toks <- comments %>%
mutate_at(c("Emoji"), list(~na_if(., "NA"))) %>%
mutate (Emoji = str_trim(Emoji)) %>%
filter(!is.na(Emoji)) %>%
pull(Emoji) %>%
tokens()
EmojiDfm <- dfm(emoji_toks)
emoji_mapping_function.R file.
source("../scripts/emoji_mapping_function.R")